This is my course diary for IODS course. Feeling good. Hoping to learn some R.
Here is my github project: https://github.com/ukkelinkamu/IODS-project
I’m somewhat familiar with R and using ggplot so this was not too difficult. Still had to think a little bit before could get the 4.6.5 excercise done (below). I also did like the final graphs in chapter 4, maybe could have some ideas from here to my own work also. R markdown is great way to approach this I think.
date()
## [1] "Mon Dec 11 13:39:27 2023"
We have learned basics of doing regression models and summaraising the data
#I created a csv file in the previous task so I start by reading it
df <- read.csv("a_learn14.csv")
df <- df[,-1]
#exploring the structure and dimensions
str(df)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(df)
## [1] 166 7
#There seems to be 166 rows and 7 columns. The data is from student survey. the survey questions had 3 categories and in this dataset they are averaged to 3 columns: attitudes, deep, surf and stra. There is also columns for age and gender.
#access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
pairs(df[-1])
summary(df)
## gender Age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
p <- ggpairs(df, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
#Respondents are young, mean 25.5 years old. 2/3 of respondents are female. The exam points correlate strongly with attitude related questions.
# fit a linear model
my_model <- lm(Points ~ attitude+stra+surf, df)
# print out a summary of the model
my_model
##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = df)
##
## Coefficients:
## (Intercept) attitude stra surf
## 11.0171 3.3952 0.8531 -0.5861
summary(my_model)
##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#attitude is strongly and statistically significantly correlated with points. correlation coefficient is 3.4661 and p-value 1.18e^-08. Other correlations are not statistically significant.
#new model without explanatory variables that were not statistically significantly correlated with points
my_model <- lm(Points ~ attitude, df)
#multiple r-squared is 0.1906, which is semi low and means that the data do not fit the model very well, data points are far away from the regression line on average. This can be also used to estimate if the model can predict how good points participant with specific attitudes would get, in this case the model predicts it badly.
#diagnostic plots
plot(my_model, which = c(1,2,5))
#diagnostic plost show the model is valid. data points are equally scattered and the red line does not have large curves.
date()
## [1] "Mon Dec 11 13:39:32 2023"
Here we go again…
We have learned basics of logistic regression
First, read the datafile. It is about student exams, including 370 respondents, and 36 variables about student characteristics and test performance.
df <- read.csv("chap3.csv")
colnames(df)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "guardian" "traveltime" "studytime"
## [16] "schoolsup" "famsup" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "failures" "paid"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
My hypothesis is that absences, failures, higher age and male sex is associated with high alcohol use.
Participants are 15-19 years old. Absences are distributed mostly to people with many absences, most students have only few of them. Also most of the students have 0 failures.
By reading the graphs, sex, absences, and age seems to be correlated with high alcohol use. failures are rare and more difficult to explore with these plots.
library(ggplot2)
library(tidyr)
#defining interesting variables
var <- c("failures", "absences", "age", "sex")
summary(df)
## X school sex age
## Min. : 1.00 Length:370 Length:370 Min. :15.00
## 1st Qu.: 93.25 Class :character Class :character 1st Qu.:16.00
## Median :185.50 Mode :character Mode :character Median :17.00
## Mean :185.50 Mean :16.58
## 3rd Qu.:277.75 3rd Qu.:17.00
## Max. :370.00 Max. :22.00
## address famsize Pstatus Medu
## Length:370 Length:370 Length:370 Min. :0.0
## Class :character Class :character Class :character 1st Qu.:2.0
## Mode :character Mode :character Mode :character Median :3.0
## Mean :2.8
## 3rd Qu.:4.0
## Max. :4.0
## Fedu Mjob Fjob reason
## Min. :0.000 Length:370 Length:370 Length:370
## 1st Qu.:2.000 Class :character Class :character Class :character
## Median :3.000 Mode :character Mode :character Mode :character
## Mean :2.557
## 3rd Qu.:3.750
## Max. :4.000
## guardian traveltime studytime schoolsup
## Length:370 Min. :1.000 Min. :1.000 Length:370
## Class :character 1st Qu.:1.000 1st Qu.:1.000 Class :character
## Mode :character Median :1.000 Median :2.000 Mode :character
## Mean :1.446 Mean :2.043
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000
## famsup activities nursery higher
## Length:370 Length:370 Length:370 Length:370
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## internet romantic famrel freetime
## Length:370 Length:370 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:4.000 1st Qu.:3.000
## Mode :character Mode :character Median :4.000 Median :3.000
## Mean :3.935 Mean :3.224
## 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000
## goout Dalc Walc health
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000
## Median :3.000 Median :1.000 Median :2.000 Median :4.000
## Mean :3.116 Mean :1.484 Mean :2.295 Mean :3.562
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## failures paid absences G1
## Min. :0.0000 Length:370 Min. : 0.000 Min. : 2.00
## 1st Qu.:0.0000 Class :character 1st Qu.: 1.000 1st Qu.:10.00
## Median :0.0000 Mode :character Median : 3.000 Median :12.00
## Mean :0.1892 Mean : 4.511 Mean :11.52
## 3rd Qu.:0.0000 3rd Qu.: 6.000 3rd Qu.:14.00
## Max. :3.0000 Max. :45.000 Max. :18.00
## G2 G3 alc_use high_use
## Min. : 4.0 Min. : 0.00 Min. :1.000 Mode :logical
## 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:1.000 FALSE:259
## Median :12.0 Median :12.00 Median :1.500 TRUE :111
## Mean :11.5 Mean :11.52 Mean :1.889
## 3rd Qu.:14.0 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.0 Max. :18.00 Max. :5.000
#drawing graphs of the variables of interest
gather(df[,var]) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")
#exploring associations with alcohol use
g1 <- ggplot(df, aes(x = high_use, col = sex))
g2 <- ggplot(df, aes(x = high_use, y = absences))
g3 <- ggplot(df, aes(x = high_use, y = age))
g4 <- ggplot(df, aes(x = high_use, y = failures))
g1 + geom_bar()
g2 + geom_boxplot()
g3 + geom_boxplot()
g4 + geom_boxplot()
Next, I define the model with the chosen variables.
Summary tells that other variables but age were statistically significantly associated with high use of alcohol. Similar finding can be interpreted from OR and coefs. Biggest correlation was for sex and male sex having higher risk for high alcohol use.
#defining the model
model <- glm(high_use ~ absences + sex + age + failures, data = df, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ absences + sex + age + failures, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1832 -0.8265 -0.6036 1.0200 2.1041
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.72535 1.74828 -2.131 0.033100 *
## absences 0.09019 0.02334 3.864 0.000111 ***
## sexM 1.00045 0.24754 4.042 5.31e-05 ***
## age 0.10850 0.10505 1.033 0.301694
## failures 0.56451 0.21098 2.676 0.007459 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 405.92 on 365 degrees of freedom
## AIC: 415.92
##
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept) absences sexM age failures
## -3.72534682 0.09019383 1.00045285 0.10849680 0.56450896
# compute odds ratios (OR)
OR <- coef(model) %>% exp
# compute confidence intervals (CI)
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02410474 0.0007493589 0.7203136
## absences 1.09438639 1.0477584017 1.1483784
## sexM 2.71951309 1.6839636172 4.4530085
## age 1.11460134 0.9076445049 1.3713869
## failures 1.75858403 1.1692521450 2.6916472
Next, I calculate a probability of high alcohol use based on the new model. By comparing the predictability and true values, we see that the model predicts quite badly if the participant is a high user or not. The model has about 70% chance of getting it right. the mean predictive error is 0.3, so about 30% of the predictions go wrong.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#new model with the significant variables
model <- glm(high_use ~ absences + sex + failures, data = df, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(model, type = "response")
df <- mutate(df, probability = predict(model, type = "response"))
df <- mutate(df, prediction = probability > 0.5)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(df, aes(x = high_use, y = probability))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = df$high_use, prediction = df$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 7
## TRUE 78 33
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = df$high_use, prob = 0)
## [1] 0.3
First, I loaded the Boston dataset, which contains house price data from sity suburbs in Boston, and also including characteristics of the suburb areas like crime rate, age of the buildings and average number of rooms.
The dataset has 506 rows and 14 columns.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
#structure and dimensions
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Next, taking summary and some graphical looks in the data…
Distance from employment centers seems to have high negative correlation with many variables, esp. indus, nos and age. Positive correlations are more scattered.
By drawing plots about median prices compared to other variables, unsurprisingly room numbers have the highest correlation.
#summary
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
#also exploring distributions in some of the key variables
library(ggplot2)
library(tidyr)
var <- c("crim","age","rm","dis","medv")
df <- Boston
gather(df[,var]) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")
#exploring associations with alcohol use
g1 <- ggplot(df, aes(x = medv, y = age))
g2 <- ggplot(df, aes(x = medv, y = crim))
g3 <- ggplot(df, aes(x = medv, y = rm))
g4 <- ggplot(df, aes(x = medv, y = dis))
g1 + geom_point()
g2 + geom_count()
g3 + geom_point()
g4 + geom_point()
Next, I scale (or standardize) the dataset and create a categorical variable on quantiles of crime rate. Scaling transforms the variables to difference from average per standard deviation.
Variables are now distributed under and over zero as they are measuring the difference from average rather than the absolute value.
Lastly, I made datasets from random 80% of rows and another from the remaining.
#scaling
s_df <- scale(df)
summary(s_df)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#creating the dataframe and setting crime rate as a numerical variable
s_df <- as.data.frame(scale(Boston))
s_df$crim <- as.numeric(s_df$crim)
#creating the cathegorical variable - first find the quantiles
bins <- quantile(s_df$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(s_df$crim, breaks = bins, include.lowest = TRUE)
#adding the variable to the dataset
s_df <- dplyr::select(s_df, -crim)
s_df <- data.frame(s_df, crime)
#converting cathegories to low, med_low, med_high and high
s_df$crime1 <- ifelse(s_df$crime == "[-0.419,-0.411]","low",ifelse(s_df$crime == "(-0.411,-0.39]","med_low",
ifelse(s_df$crime == "(-0.39,0.00739]","med_high","high")))
#choosing random 80%
ind <- sample(506, size = 506 * 0.8)
#train dataset including 80% of rows
train <- s_df[ind,]
# create test set with remaining
test <- s_df[-ind,]
Next, i performed a linear discriminant analysis.
Rad (index of accessibility to radial highways), has the biggest correlation an so has the highest contribution to first and second discriminant. The first linear discriminant has about 95% of the predictive power in separating the categories.
Biplot describes the same phenomena, on the x-axis LD1 (first dicriminant) shows clear separation of the categories, and arrow showing the correlation for rad is the largest and so having the most influense in the separation between categories.
#formula for the analysis
colnames(train)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "crime"
## [15] "crime1"
lda.fit <- lda(crime ~ zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + lstat + medv, data = train)
lda.fit
## Call:
## lda(crime ~ zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat + medv, data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2549505 0.2425743 0.2574257 0.2450495
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.411] 1.0163736 -0.8885843 -0.11943197 -0.8890101 0.40394961
## (-0.411,-0.39] -0.1278160 -0.3222216 -0.07145661 -0.5596345 -0.17298546
## (-0.39,0.00739] -0.3866438 0.2558808 0.21980846 0.4303870 0.08250778
## (0.00739,9.92] -0.4872402 1.0149946 -0.03371693 1.0572965 -0.47757670
## age dis rad tax ptratio
## [-0.419,-0.411] -0.9403249 0.9084867 -0.6863918 -0.7178521 -0.41894352
## (-0.411,-0.39] -0.3466774 0.3301839 -0.5435787 -0.5117410 -0.09540841
## (-0.39,0.00739] 0.4278798 -0.3912372 -0.4197849 -0.2834153 -0.25305084
## (0.00739,9.92] 0.7787188 -0.8391287 1.6596029 1.5294129 0.80577843
## black lstat medv
## [-0.419,-0.411] 0.37656652 -0.77644152 0.49300127
## (-0.411,-0.39] 0.31791413 -0.13053310 -0.01410717
## (-0.39,0.00739] 0.03709248 0.01820649 0.15691409
## (0.00739,9.92] -0.74112383 0.85280978 -0.64144898
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.112281617 0.65955925 -1.01720763
## indus 0.046554090 -0.23237937 0.17453319
## chas -0.068330251 -0.09313893 0.04560553
## nox 0.229932955 -0.78579773 -1.35855537
## rm -0.151941724 -0.12171536 -0.24712574
## age 0.257458308 -0.31695461 -0.05714234
## dis -0.123951142 -0.25492166 0.11076816
## rad 3.440568631 1.03322531 0.25171683
## tax 0.009648052 -0.07146369 0.25722303
## ptratio 0.135580115 -0.01150357 -0.43735711
## black -0.136497949 0.02398717 0.14336826
## lstat 0.163378390 -0.28908112 0.28255371
## medv 0.198842140 -0.41231128 -0.26618555
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9464 0.0410 0.0126
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
Next, I test if the model can predict the crime categories well. crime categories were already saved to crime1 variable, so I do not have to do it here.
All high crime areas were rightly predicted, also med-low category was well predicted, other 2 categories did not go so well.
#I already saved the variable to crime1 before
#removing the crime variable
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
lda.pred
## $class
## [1] (-0.411,-0.39] (-0.411,-0.39] (-0.39,0.00739] (-0.39,0.00739]
## [5] (-0.411,-0.39] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [9] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [13] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39] (-0.39,0.00739]
## [17] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39] (-0.411,-0.39]
## [21] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [25] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39] (-0.411,-0.39]
## [29] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [33] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [37] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39] (-0.39,0.00739]
## [41] (-0.39,0.00739] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39]
## [45] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411] (-0.39,0.00739]
## [49] (-0.39,0.00739] (-0.39,0.00739] [-0.419,-0.411] (-0.411,-0.39]
## [53] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] [-0.419,-0.411]
## [57] [-0.419,-0.411] (-0.411,-0.39] (-0.411,-0.39] (-0.411,-0.39]
## [61] (-0.411,-0.39] (-0.411,-0.39] [-0.419,-0.411] (-0.411,-0.39]
## [65] [-0.419,-0.411] (-0.411,-0.39] [-0.419,-0.411] [-0.419,-0.411]
## [69] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [73] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [77] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [81] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [85] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [89] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [93] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92] (0.00739,9.92]
## [97] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [101] (-0.39,0.00739] (-0.39,0.00739]
## Levels: [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##
## $posterior
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 6 3.415147e-01 5.935579e-01 6.492744e-02 1.340458e-21
## 8 2.294675e-02 4.928168e-01 4.842365e-01 3.374273e-16
## 22 7.741513e-02 4.014770e-01 5.211079e-01 6.362517e-17
## 23 6.263573e-02 4.559235e-01 4.814408e-01 5.742044e-17
## 45 2.454011e-01 7.383210e-01 1.627793e-02 2.115320e-22
## 56 9.953927e-01 3.881981e-03 7.253431e-04 1.940465e-20
## 65 4.674795e-01 4.685550e-01 6.396551e-02 2.950828e-22
## 70 3.452631e-01 6.470159e-01 7.720948e-03 4.222389e-21
## 71 3.135011e-01 6.787834e-01 7.715573e-03 4.528157e-22
## 74 2.880272e-01 7.047996e-01 7.173140e-03 6.492172e-22
## 89 3.974872e-01 5.125633e-01 8.994953e-02 2.350173e-22
## 91 3.741189e-01 5.763610e-01 4.952015e-02 2.605435e-22
## 93 5.851190e-01 3.687635e-01 4.611749e-02 4.092397e-19
## 98 3.900779e-01 4.451796e-01 1.647425e-01 9.383788e-23
## 100 4.164040e-01 5.190130e-01 6.458303e-02 5.132045e-23
## 101 8.059273e-02 3.117294e-01 6.076778e-01 5.802136e-16
## 105 7.722707e-02 4.584813e-01 4.642916e-01 2.235536e-15
## 109 6.979037e-02 4.054396e-01 5.247700e-01 1.850857e-15
## 115 6.016527e-02 5.586779e-01 3.811569e-01 6.748134e-15
## 117 5.215625e-02 5.595723e-01 3.882714e-01 4.830952e-15
## 123 2.196949e-02 2.517184e-01 7.263121e-01 6.772675e-19
## 129 7.752139e-03 6.970617e-02 9.225417e-01 4.651681e-16
## 143 5.406357e-05 2.538118e-03 9.974078e-01 3.777594e-15
## 144 5.842574e-05 1.982898e-03 9.979587e-01 4.809069e-14
## 148 5.042189e-05 2.669287e-03 9.972803e-01 1.621895e-13
## 171 1.111302e-02 3.664513e-01 6.224357e-01 2.163626e-15
## 177 1.769301e-01 7.416789e-01 8.139098e-02 3.723918e-18
## 186 1.817382e-01 6.505295e-01 1.677322e-01 1.453386e-19
## 190 8.275910e-01 1.566270e-01 1.578199e-02 3.211171e-19
## 202 9.899623e-01 9.905916e-03 1.318310e-04 1.682446e-24
## 203 9.955019e-01 4.176812e-03 3.213067e-04 1.941131e-25
## 205 9.958545e-01 2.543232e-03 1.602259e-03 6.486924e-21
## 209 6.678850e-02 7.685462e-01 1.646653e-01 1.397203e-19
## 212 1.549640e-02 7.875999e-01 1.969037e-01 8.388023e-18
## 217 4.487934e-02 6.755518e-01 2.795689e-01 4.868487e-18
## 219 1.261190e-02 4.967698e-01 4.906183e-01 9.851816e-17
## 224 3.674177e-02 4.180067e-01 5.452515e-01 1.897253e-12
## 228 4.155971e-02 3.440186e-01 6.144217e-01 8.798241e-13
## 230 1.933277e-01 6.182832e-01 1.883890e-01 2.229892e-14
## 232 5.044068e-02 3.412891e-01 6.082702e-01 2.794593e-13
## 238 5.907567e-02 3.712407e-01 5.696836e-01 1.291552e-13
## 239 6.656810e-01 3.269925e-01 7.326500e-03 6.142693e-19
## 242 3.127241e-01 6.560004e-01 3.127551e-02 9.041857e-17
## 243 3.752587e-01 5.940010e-01 3.074022e-02 2.290106e-17
## 246 8.139849e-02 8.256501e-01 9.295140e-02 3.322418e-14
## 249 2.681731e-01 6.436664e-01 8.816041e-02 1.253964e-15
## 251 5.277318e-01 4.510993e-01 2.116891e-02 2.366502e-17
## 258 1.427054e-02 8.267458e-03 9.774620e-01 3.617225e-17
## 267 4.537491e-02 8.045334e-02 8.741718e-01 4.447768e-16
## 268 6.475447e-02 6.238779e-02 8.728577e-01 2.090502e-17
## 274 5.899907e-01 2.760679e-01 1.339414e-01 2.735015e-22
## 281 3.671798e-01 3.946099e-01 2.382103e-01 2.015368e-18
## 282 5.207789e-01 4.432095e-01 3.601160e-02 1.129943e-19
## 286 9.523437e-01 4.746874e-02 1.875337e-04 3.360524e-27
## 298 3.117581e-02 9.488873e-01 1.993688e-02 2.107370e-20
## 306 6.181484e-01 2.467095e-01 1.351421e-01 5.752393e-14
## 307 5.521633e-01 1.730366e-01 2.748001e-01 6.367792e-14
## 309 1.907165e-01 4.413061e-01 3.679774e-01 1.590655e-18
## 313 8.982266e-02 5.387228e-01 3.714545e-01 2.197643e-17
## 317 4.875802e-02 5.920085e-01 3.592334e-01 1.911876e-17
## 323 2.308400e-01 6.727829e-01 9.637710e-02 4.488557e-18
## 324 1.038083e-01 7.363605e-01 1.598312e-01 8.477375e-17
## 333 8.776257e-01 1.213803e-01 9.940770e-04 2.612947e-26
## 337 2.578908e-01 5.992810e-01 1.428282e-01 8.868653e-18
## 342 8.716369e-01 1.214222e-01 6.940936e-03 4.840294e-26
## 346 1.778123e-01 8.066802e-01 1.550750e-02 1.049464e-22
## 353 9.381368e-01 6.113021e-02 7.330175e-04 1.642421e-22
## 356 9.938894e-01 5.732705e-03 3.779450e-04 1.598365e-21
## 373 5.871092e-22 1.565019e-18 3.311061e-14 1.000000e+00
## 376 1.074323e-18 7.727281e-16 8.993652e-13 1.000000e+00
## 378 6.374323e-20 1.222753e-16 1.270061e-13 1.000000e+00
## 386 5.432352e-22 3.728041e-18 2.920217e-15 1.000000e+00
## 390 3.808832e-21 1.072531e-17 1.054298e-14 1.000000e+00
## 399 1.357478e-21 8.558663e-18 5.065810e-15 1.000000e+00
## 401 1.035770e-20 3.662835e-17 2.517011e-14 1.000000e+00
## 406 1.354651e-20 3.524724e-17 1.880527e-14 1.000000e+00
## 409 8.801032e-22 7.254620e-18 2.137810e-15 1.000000e+00
## 412 5.177426e-22 1.599451e-18 1.610409e-15 1.000000e+00
## 414 4.519492e-22 2.345874e-18 6.604927e-16 1.000000e+00
## 417 8.484241e-22 1.791049e-18 4.101712e-15 1.000000e+00
## 421 1.021559e-20 1.130415e-17 4.717060e-14 1.000000e+00
## 428 7.165423e-21 5.488031e-18 7.797905e-15 1.000000e+00
## 431 1.587739e-20 3.275321e-17 9.172562e-15 1.000000e+00
## 432 1.252249e-20 2.969521e-17 1.249061e-14 1.000000e+00
## 435 2.028959e-21 2.002038e-18 7.723915e-15 1.000000e+00
## 437 5.860635e-22 5.367276e-19 4.420327e-15 1.000000e+00
## 447 6.401314e-21 7.994719e-18 4.413807e-14 1.000000e+00
## 450 9.547125e-21 1.520965e-17 4.467088e-14 1.000000e+00
## 454 1.333891e-19 1.353897e-16 7.089496e-13 1.000000e+00
## 457 2.367026e-22 3.568941e-19 1.817851e-15 1.000000e+00
## 460 5.635002e-20 6.871138e-17 1.718054e-13 1.000000e+00
## 463 1.437042e-19 1.429380e-16 3.501304e-13 1.000000e+00
## 475 5.016288e-20 2.027825e-16 1.953482e-14 1.000000e+00
## 476 1.995229e-20 1.077959e-16 1.901657e-14 1.000000e+00
## 477 2.310025e-19 5.841359e-16 1.835539e-13 1.000000e+00
## 484 2.485149e-16 3.296629e-13 4.977201e-12 1.000000e+00
## 489 4.987598e-03 2.539640e-01 7.410484e-01 1.551178e-15
## 490 4.183949e-03 3.352381e-01 6.605779e-01 3.684271e-15
## 496 6.476228e-02 4.641939e-01 4.710439e-01 7.008098e-15
## 498 4.448427e-02 3.681579e-01 5.873578e-01 2.402264e-14
## 504 2.763678e-01 1.829549e-01 5.406774e-01 8.665412e-22
## 505 2.938760e-01 2.159274e-01 4.901965e-01 8.763529e-22
##
## $x
## LD1 LD2 LD3
## 6 -3.1923558 -0.19235611 0.55757233
## 8 -1.7017992 -1.11573040 0.89525879
## 22 -1.9460934 -0.82246127 0.06102509
## 23 -1.9492054 -0.87653956 0.31952688
## 45 -3.3582353 0.18853735 1.50189171
## 56 -2.6700815 2.66807938 -2.80351207
## 65 -3.3685120 -0.19952400 0.19035785
## 70 -3.0220483 0.96362366 1.50060935
## 71 -3.2671900 0.72187712 1.60252851
## 74 -3.2223808 0.75712289 1.70913410
## 89 -3.3959409 -0.45191037 0.21715857
## 91 -3.3725529 -0.17277859 0.60085127
## 93 -2.5592283 0.71043712 -0.03758029
## 98 -3.5051516 -0.84114012 -0.14828303
## 100 -3.5610255 -0.40959614 0.34632271
## 101 -1.6962837 -0.68096654 -0.26433524
## 105 -1.5521191 -0.44108647 0.21942070
## 109 -1.5659412 -0.55910670 0.10796820
## 115 -1.4188340 -0.34156728 0.60642093
## 117 -1.4489229 -0.43785891 0.67444837
## 123 -2.3751293 -1.90193079 0.15287188
## 129 -1.5510858 -1.84037024 -0.62364003
## 143 -0.9353114 -3.66866145 -1.18986080
## 144 -0.6472415 -3.40657862 -1.46703151
## 148 -0.5149545 -3.35263213 -1.12124487
## 171 -1.4511507 -1.35803857 0.88193353
## 177 -2.2878879 0.16080943 0.97781387
## 186 -2.6612851 -0.48069748 0.55819333
## 190 -2.5499969 1.34996557 -0.56532696
## 202 -3.7080274 2.65269093 -1.20205272
## 203 -3.9365380 2.01683972 -2.36041812
## 205 -2.7938626 2.17721583 -3.51555420
## 209 -2.6182811 -0.87181774 1.23097529
## 212 -2.0897842 -1.16513418 1.91125587
## 217 -2.2087457 -0.96575172 1.08485298
## 219 -1.8076749 -1.47206988 1.20726749
## 224 -0.7621965 -0.19904890 0.42291744
## 228 -0.8497482 -0.27972440 0.13380723
## 230 -1.3360576 0.57743475 0.38883978
## 232 -0.9871677 -0.30274110 0.03585694
## 238 -1.0831097 -0.27806226 0.06259134
## 239 -2.4764510 1.70249371 0.53831517
## 242 -1.9377516 1.14782622 0.95423762
## 243 -2.0962766 1.10284469 0.78147905
## 246 -1.2410032 0.61884922 1.38590687
## 249 -1.6587436 0.81767562 0.58072798
## 251 -2.0925362 1.42353119 0.50594237
## 258 -1.7911176 -1.86724244 -2.91980370
## 267 -1.6538395 -1.11646995 -1.37090278
## 268 -2.0037204 -1.25481702 -1.77578523
## 274 -3.3860582 -0.47915014 -0.72007487
## 281 -2.3952518 -0.13555998 -0.41611606
## 282 -2.6976432 0.66884409 0.29789259
## 286 -4.4627895 1.90156420 0.14373769
## 298 -2.7502141 -0.30848792 2.68723285
## 306 -1.2514162 1.28530710 -0.92075880
## 307 -1.2365966 0.89978734 -1.48229067
## 309 -2.4004042 -0.63042330 -0.15581237
## 313 -2.0756837 -0.69321689 0.39952649
## 317 -2.0615102 -0.93157739 0.81279698
## 323 -2.2813178 0.19999811 0.68197550
## 324 -1.9261573 -0.09694184 0.95534856
## 333 -4.3007391 1.23996665 0.36178867
## 337 -2.2157215 0.11244318 0.35474103
## 342 -4.2740386 0.33842795 -0.43344910
## 346 -3.4213538 0.02081844 1.77059614
## 353 -3.2990508 2.21281081 -0.21284678
## 356 -2.9479569 2.76113027 -2.16616018
## 373 6.7872050 -0.99694417 -1.00907735
## 376 6.0904690 0.38159605 -0.46621758
## 378 6.3501803 0.21677689 0.07452078
## 386 6.8124329 0.16860787 0.82829745
## 390 6.6426308 0.31388871 0.28427126
## 399 6.7215254 0.26403985 0.90291435
## 401 6.5257833 0.28726605 0.55180010
## 406 6.5192966 0.53689875 0.49879805
## 409 6.7693099 0.51557425 1.32510363
## 412 6.8588431 0.43918816 0.31485557
## 414 6.8714449 0.82460635 1.10360148
## 417 6.8081830 0.17602550 -0.21605018
## 421 6.5558029 -0.03090925 -0.78512611
## 428 6.6401602 0.71057140 -0.53385330
## 431 6.5291753 0.95248778 0.64429422
## 432 6.5386267 0.70622936 0.54829157
## 435 6.7441134 0.21125179 -0.81773147
## 437 6.8701976 -0.01160263 -1.17149204
## 447 6.5946881 -0.18486679 -0.83938078
## 450 6.5497344 -0.02999175 -0.45401352
## 454 6.2698985 -0.33505477 -0.91331957
## 457 6.9525084 0.06403242 -0.72143944
## 460 6.3712015 0.01742138 -0.51904957
## 463 6.2793127 0.04141391 -0.61194664
## 475 6.3851258 1.04351680 1.43186937
## 476 6.4576824 0.68882095 1.32957774
## 477 6.2167619 0.55190812 0.71143257
## 484 5.5433458 1.72101321 1.64966562
## 489 -1.4361078 -1.79326447 0.88174886
## 490 -1.3381426 -1.72650710 1.27202238
## 496 -1.4163680 -0.41359798 0.31066903
## 498 -1.2556558 -0.55931575 0.19325743
## 504 -3.2328031 -1.36190332 -1.28944842
## 505 -3.2387285 -1.28776934 -1.12752834
# cross tabulate the results
table(correct = test$crime1, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## high 0 0 1 27
## low 12 11 1 0
## med_high 0 7 14 1
## med_low 5 15 8 0
Next, reload and standardize the boston dataset.
Then, I try to determine the optimal number of clusters. This can be done by looking when total of within cluster sum of squares drops radically when increasing the number of clusters. This time it seems to be number 2.
Interpretation of the plot: again we see that rad is the best predictor for clustering (separates black and red dots) and works as well with different other variables. Tax could be the second best.
#reload and scale
data("Boston")
df <- scale(Boston)
#distances
dist_df <- dist(df)
summary(dist_df)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#k-means
km <- kmeans(df, centers = 4)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(df, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# k-means clustering
km <- kmeans(df, centers = 2)
# plot the Boston dataset with clusters
pairs(df, col = km$cluster)
Bonus
First, reloading boston dataset and scaling it.
In the biplot on the x-axis LD1 (first dicriminant) shows clear separation of the categories, and arrow showing the correlation for rad is the largest and so having the most influense in the separation between categories.
# access the MASS package
library(MASS)
# load the data
data("Boston")
#scaling and forming a dataframe
scaled <- as.data.frame(scale(Boston))
lda_df <- lda(crime ~ zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + lstat + medv, data = train)
# the function for arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(scaled$crime)
# plot the lda results
plot(lda_df, dimen = 2)
lda.arrows(lda_df, myscale = 1)
First, lets set the rownames to country names and explore the data and distributions.
#setting rownames to countries
df <- read.csv("human.csv")
row.names(df) <- df$Country
df <- df[,2:9]
#exploring the data and distributions
library(GGally)
ggpairs(df)
library(corrplot)
cor(df) %>% corrplot()
Maternal mortality and life expectancy are highly correlated as well as adolecent birth rate and maternal mortality. Also some other very logical correlations like life expectancy and expected years of schooling. Distributions of ado.birth, mat.mort, GNI, and life.exp are highly skewed, other distribution look more symmetric/normal.
Performing principal component analysis. First, is the standardized data analysis.
#standardization
df_s <- scale(df)
# print out summaries of the standardized variables
summary(df_s)
## Edu.Ratio Labor.Ratio Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_df <- prcomp(df_s)
pca_df
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu.Ratio -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labor.Ratio 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu.Ratio 0.17713316 0.05773644 0.16459453
## Labor.Ratio -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
# draw a biplot of the principal component representation and the original variables
biplot(pca_df, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
Mat.mor and life.exp have the biggest correlation to the first principal component, explaning most in the variance. labor.ratio and parli.f give the largest correlation to the second principal component.
Next lets do the same with non-standardized data.
pca_df_n <- prcomp(df)
pca_df_n
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Edu.Ratio -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## Labor.Ratio 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## Edu.Ratio -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## Labor.Ratio 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu.Exp 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## Life.Exp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
biplot(pca_df_n, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
summary(df)
## Edu.Ratio Labor.Ratio Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The pca and plots are very different. As the data is non-standardised, GNI which has the largest scale and therefore largest number for variance/standard deviation, and will have the highest contiribution to PCA.
The analysis suggests that as edu.ratio, edu.exp, life.exp, GNI, mat.mor and ado.birth have about similar correlations to component one, the tend also to correlate to each other as such that when one variable changes, so change the others, either to the same direction or to the other depending on +/-. The same counts for Parli.f and labor.ratio which had the highest correlations to component 2.
Next, let’s work with the tea data.
#loading the data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
library(FactoMineR)
library(ggplot2)
#exploring the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
#taking the firs 10 variables
tea <- tea[,1:10]
#performing the analysis and giving the summary
mca <- MCA(tea, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.179 0.129 0.115 0.107 0.104 0.090 0.079
## % of var. 17.939 12.871 11.481 10.654 10.406 8.985 7.917
## Cumulative % of var. 17.939 30.810 42.291 52.945 63.351 72.337 80.253
## Dim.8 Dim.9 Dim.10
## Variance 0.076 0.068 0.054
## % of var. 7.559 6.793 5.395
## Cumulative % of var. 87.812 94.605 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.543 0.549 0.474 | -0.462 0.552 0.342 | -0.007 0.000
## 2 | -0.543 0.549 0.474 | -0.462 0.552 0.342 | -0.007 0.000
## 3 | -0.068 0.009 0.002 | 0.542 0.761 0.141 | -0.139 0.056
## 4 | -1.037 1.998 0.558 | 0.264 0.180 0.036 | -0.018 0.001
## 5 | -0.235 0.103 0.061 | -0.011 0.000 0.000 | 0.124 0.044
## 6 | -1.037 1.998 0.558 | 0.264 0.180 0.036 | -0.018 0.001
## 7 | 0.024 0.001 0.001 | -0.404 0.422 0.374 | -0.185 0.099
## 8 | -0.193 0.069 0.054 | 0.058 0.009 0.005 | -0.382 0.423
## 9 | -0.258 0.124 0.117 | -0.624 1.007 0.680 | -0.130 0.049
## 10 | -0.048 0.004 0.002 | -0.153 0.061 0.020 | -0.176 0.090
## cos2
## 1 0.000 |
## 2 0.000 |
## 3 0.009 |
## 4 0.000 |
## 5 0.017 |
## 6 0.000 |
## 7 0.078 |
## 8 0.210 |
## 9 0.030 |
## 10 0.027 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | 0.209 1.168 0.040 3.471 | -0.770 22.090 0.547
## Not.breakfast | -0.193 1.078 0.040 -3.471 | 0.710 20.391 0.547
## Not.tea time | -0.680 11.253 0.358 -10.351 | 0.327 3.635 0.083
## tea time | 0.527 8.723 0.358 10.351 | -0.254 2.817 0.083
## evening | 0.445 3.787 0.103 5.562 | 0.634 10.728 0.210
## Not.evening | -0.233 1.980 0.103 -5.562 | -0.332 5.609 0.210
## lunch | 0.947 7.329 0.154 6.788 | 0.167 0.317 0.005
## Not.lunch | -0.163 1.260 0.154 -6.788 | -0.029 0.054 0.005
## dinner | -1.571 9.633 0.186 -7.454 | 1.044 5.925 0.082
## Not.dinner | 0.118 0.725 0.186 7.454 | -0.079 0.446 0.082
## v.test Dim.3 ctr cos2 v.test
## breakfast -12.786 | -0.031 0.041 0.001 -0.518 |
## Not.breakfast 12.786 | 0.029 0.038 0.001 0.518 |
## Not.tea time 4.983 | 0.235 2.102 0.043 3.579 |
## tea time -4.983 | -0.182 1.630 0.043 -3.579 |
## evening 7.929 | -0.599 10.742 0.188 -7.494 |
## Not.evening -7.929 | 0.313 5.616 0.188 7.494 |
## lunch 1.195 | -1.133 16.410 0.221 -8.125 |
## Not.lunch -1.195 | 0.195 2.820 0.221 8.125 |
## dinner 4.951 | -0.090 0.050 0.001 -0.428 |
## Not.dinner -4.951 | 0.007 0.004 0.001 0.428 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.040 0.547 0.001 |
## tea.time | 0.358 0.083 0.043 |
## evening | 0.103 0.210 0.188 |
## lunch | 0.154 0.005 0.221 |
## dinner | 0.186 0.082 0.001 |
## always | 0.089 0.096 0.414 |
## home | 0.008 0.114 0.005 |
## work | 0.215 0.006 0.251 |
## tearoom | 0.315 0.003 0.018 |
## friends | 0.325 0.141 0.008 |
plot(mca, invisible=c("ind"), graph.type = "classic")
plot(mca, invisible=c("ind"), graph.type = c("ggplot")) + theme(panel.grid.major = element_blank(),
plot.title=element_text(size=14, color="blue"),
axis.title = element_text(size=12, color="red"))
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Tea-time, work, tearoom, and friends are strongly correlated to the firs dimension and as such tend to change together. Breakfast, evening are strongly correlated to the second dimension. lunch, always and work are strongly correlated to the third dimension.
In the first part, I will analyse the RATS data, by providing graphical summaries and using linear mixed model.
#read the data
RATS <- read.csv("RATS.csv")
#summary of the rats data
summary(RATS)
## X ID Group days
## Min. : 1.00 Min. : 1.00 Min. :1.00 Length:176
## 1st Qu.: 44.75 1st Qu.: 4.75 1st Qu.:1.00 Class :character
## Median : 88.50 Median : 8.50 Median :1.50 Mode :character
## Mean : 88.50 Mean : 8.50 Mean :1.75
## 3rd Qu.:132.25 3rd Qu.:12.25 3rd Qu.:2.25
## Max. :176.00 Max. :16.00 Max. :3.00
## RATS Time
## Min. :225.0 Min. : 1.00
## 1st Qu.:267.0 1st Qu.:15.00
## Median :344.5 Median :36.00
## Mean :384.5 Mean :33.55
## 3rd Qu.:511.2 3rd Qu.:50.00
## Max. :628.0 Max. :64.00
#Access the package ggplot2
library(ggplot2)
RATS$Group <- factor(RATS$Group)
RATS$ID <- factor(RATS$ID)
# Draw the plot
ggplot(RATS, aes(x = Time, y = RATS, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS$RATS), max(RATS$RATS)))
I started by loading the data. the dataset contains longitudinal information on rats’ weights.
It seems that in group 2 and 3 the starting weight is much higher than in group1. The increase seems to be about the same.
Next, I will get the group means for different timepoints and draw a plot on that data.
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
library(dplyr)
library(tidyr)
#stabdardize the data
RATSS <- RATS %>%
group_by(Time) %>%
mutate( stdRATS = (RATS - mean(RATS))/sd(RATS)) %>%
ungroup()
#number of participants per group
RATSS$n <- ifelse(RATSS$Group == 1, 8, 4)
#calculate standard error
RATSS1 <- RATSS %>%
group_by(Group, Time) %>%
summarise(mean = mean(RATS), se = sd(RATS)/sqrt(n) )%>%
ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS1, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(RATS) +/- se(RATS)")
Group 1 has smaller standard errors as other groups. This plot does not give much more information than the previous, group 2 and 3 probably increase a little faster than group 1.
Checking data outliers
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
RATS1 <- RATS %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(RATS) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
RATSS1 <- RATSS %>%
group_by(Group, Time) %>%
summarise(mean = mean(RATS), se = sd(RATS)/sqrt(n) )%>%
ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
ggplot(RATS1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
Well, in group 2 there is one outlier, but does not seem to be too far away.
T test sould not be performed as there are 3 groups, but lets do the anova.
# Add the baseline from the original data as a new variable to the summary data
RATS2 <- RATS1 %>%
mutate(baseline = RATS[RATS$Time == 1,]$RATS)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATS2)
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The weights increased statistically significantly from the baseline, but were not significantly different between groups.
First, load the data and plot the ratings per weeks.
#read the data
BPRS <- read.csv("BPRS.csv")
#summary of the BPRS data
summary(BPRS)
## X treatment subject weeks
## Min. : 1.00 Min. :1.0 Min. : 1.00 Length:360
## 1st Qu.: 90.75 1st Qu.:1.0 1st Qu.: 5.75 Class :character
## Median :180.50 Median :1.5 Median :10.50 Mode :character
## Mean :180.50 Mean :1.5 Mean :10.50
## 3rd Qu.:270.25 3rd Qu.:2.0 3rd Qu.:15.25
## Max. :360.00 Max. :2.0 Max. :20.00
## BPRS week
## Min. :18.00 Min. :0
## 1st Qu.:27.00 1st Qu.:2
## Median :35.00 Median :4
## Mean :37.66 Mean :4
## 3rd Qu.:43.00 3rd Qu.:6
## Max. :95.00 Max. :8
#Access the package ggplot2
library(ggplot2)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
# Draw the plot
ggplot(BPRS, aes(x = week, y = BPRS, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRS$BPRS), max(BPRS$BPRS)))
Rating seem to be decreasing in both groups. Variability also decreasing.
Next let’s create the regression model.
model1 <- lm(BPRS ~ treatment + week, data = BPRS)
model1
##
## Call:
## lm(formula = BPRS ~ treatment + week, data = BPRS)
##
## Coefficients:
## (Intercept) treatment2 week
## 46.4539 0.5722 -2.2704
It shows large covariance with time and smaller with treatment groups. This means ratings are strongly decreasing as time goes and treament group 2 have littel higher scores.
Next, The Random Intercept Model.
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(BPRS ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Model quality seems to be high. similarly fixed effects show the same correlations.
Next, Random Slope Model and comparing it to the intercept model.
#random slope model
BPRS_ref1 <- lmer(BPRS ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
#comparison with anova
anova(BPRS_ref, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref: BPRS ~ week + treatment + (1 | subject)
## BPRS_ref1: BPRS ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
P-value is 0.026, meaning that slope model fits well with the intercept model.
Next, lets add the intercept of time and treatment group
BPRS_ref2 <- lmer(BPRS ~ week + treatment + (week | subject) + week*treatment, data = BPRS, REML = FALSE)
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRS
## Models:
## BPRS_ref1: BPRS ~ week + treatment + (week | subject)
## BPRS_ref2: BPRS ~ week + treatment + (week | subject) + week * treatment
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject) + week * treatment
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
Fits ok but not significant.
Last, plotting with raw and fitted values.
#raw values
ggplot(BPRS, aes(x = week, y = BPRS, group = subject)) +
geom_line() +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "scores") +
theme(legend.position = "top")
Fitted <- fitted(BPRS_ref2)
BPRS$fitted <- Fitted
#with fitted values
ggplot(BPRS, aes(x = week, y = fitted, group = subject)) +
geom_line(aes(linetype = subject)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "scores") +
theme(legend.position = "top")
Fitted values produce clearer plot as the values are adjusted for time and treatment group. Scores are decreasing as time goes